Chapter 5 - New Developments: Topic Modeling with BERTopic!#

2022 July 30

bertopic

What is BERTopic?#

  • As part of NLP analysis, it’s likely that at some point you will be asked, “What topics are most common in these documents?”

    • Though related, this question is definitely distinct from a query like “What words or phrases are most common in this corpus?”

      • For example, the sentences “I enjoy learning to code.” and “Educating myself on new computer programming techniques makes me happy!” contain wholly unique tokens, but encode a similar sentiment.

      • If possible, we would like to extract generalized topics instead of specific words/phrases to get an idea of what a document is about.

  • This is where BERTopic comes in! BERTopic is a cutting-edge methodology that leverages the transformers defining the base BERT technique along with other ML tools to provide a flexible and powerful topic modeling module (with great visualization support as well!)

  • In this notebook, we’ll go through the operation of BERTopic’s key functionalities and present resources for further exploration.

Required installs:#

# Installs the base bertopic module:
!pip install bertopic 

# If you want to use other transformers/language backends, it may require additional installs: 
!pip install bertopic[flair] # can substitute 'flair' with 'gensim', 'spacy', 'use'

# bertopic also comes with its own handy visualization suite: 
!pip install bertopic[visualization]
Collecting bertopic
  Using cached bertopic-0.11.0-py2.py3-none-any.whl (76 kB)
Requirement already satisfied: pandas>=1.1.5 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (1.4.3)
Requirement already satisfied: numpy>=1.20.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (1.23.1)
Collecting pyyaml<6.0
  Using cached PyYAML-5.4.1-cp39-cp39-macosx_10_9_x86_64.whl (259 kB)
Collecting plotly>=4.7.0
  Using cached plotly-5.10.0-py2.py3-none-any.whl (15.2 MB)
Collecting umap-learn>=0.5.0
  Using cached umap-learn-0.5.3.tar.gz (88 kB)
  Preparing metadata (setup.py) ... ?25l-
 done
?25hRequirement already satisfied: scikit-learn>=0.22.2.post1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (1.1.2)
Collecting hdbscan>=0.8.28
  Using cached hdbscan-0.8.28.tar.gz (5.2 MB)
  Installing build dependencies ... ?25l-
 \
 |
 /
 -
 done
?25h  Getting requirements to build wheel ... ?25l- done
?25h  Preparing metadata (pyproject.toml) ... ?25l-
 done
?25hRequirement already satisfied: tqdm>=4.41.1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from bertopic) (4.64.0)
Collecting sentence-transformers>=0.4.1
  Using cached sentence-transformers-2.2.2.tar.gz (85 kB)
  Preparing metadata (setup.py) ... ?25l-
 done
?25hRequirement already satisfied: joblib>=1.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from hdbscan>=0.8.28->bertopic) (1.1.0)
Requirement already satisfied: scipy>=1.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from hdbscan>=0.8.28->bertopic) (1.9.0)
Collecting cython>=0.27
  Using cached Cython-0.29.32-py2.py3-none-any.whl (986 kB)
Requirement already satisfied: pytz>=2020.1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from pandas>=1.1.5->bertopic) (2022.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from pandas>=1.1.5->bertopic) (2.8.2)
Collecting tenacity>=6.2.0
  Using cached tenacity-8.0.1-py3-none-any.whl (24 kB)
Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from scikit-learn>=0.22.2.post1->bertopic) (3.1.0)
Collecting transformers<5.0.0,>=4.6.0
  Using cached transformers-4.21.1-py3-none-any.whl (4.7 MB)
Collecting torch>=1.6.0
  Using cached torch-1.12.1-cp39-none-macosx_10_9_x86_64.whl (133.8 MB)
Collecting torchvision
  Using cached torchvision-0.13.1-cp39-cp39-macosx_10_9_x86_64.whl (1.3 MB)
Requirement already satisfied: nltk in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from sentence-transformers>=0.4.1->bertopic) (3.7)
Collecting sentencepiece
  Using cached sentencepiece-0.1.97-cp39-cp39-macosx_10_9_x86_64.whl (1.2 MB)
Collecting huggingface-hub>=0.4.0
  Using cached huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
Collecting numba>=0.49
  Using cached numba-0.56.0-cp39-cp39-macosx_10_14_x86_64.whl (2.4 MB)
Collecting pynndescent>=0.5
  Using cached pynndescent-0.5.7.tar.gz (1.1 MB)
  Preparing metadata (setup.py) ... ?25l-
 done
?25hCollecting filelock
  Using cached filelock-3.8.0-py3-none-any.whl (10 kB)
Requirement already satisfied: requests in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (2.28.1)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (4.3.0)
Requirement already satisfied: packaging>=20.9 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (21.3)
Collecting llvmlite<0.40,>=0.39.0dev0
  Using cached llvmlite-0.39.0-cp39-cp39-macosx_10_9_x86_64.whl (25.5 MB)
Requirement already satisfied: setuptools in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from numba>=0.49->umap-learn>=0.5.0->bertopic) (63.4.1)
Collecting numpy>=1.20.0
  Using cached numpy-1.22.4-cp39-cp39-macosx_10_15_x86_64.whl (17.7 MB)
Requirement already satisfied: six>=1.5 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from python-dateutil>=2.8.1->pandas>=1.1.5->bertopic) (1.16.0)
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Using cached tokenizers-0.12.1-cp39-cp39-macosx_10_11_x86_64.whl (3.6 MB)
Requirement already satisfied: regex!=2019.12.17 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers>=0.4.1->bertopic) (2022.7.25)
Requirement already satisfied: click in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from nltk->sentence-transformers>=0.4.1->bertopic) (8.1.3)
Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from torchvision->sentence-transformers>=0.4.1->bertopic) (9.2.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from packaging>=20.9->huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (3.0.9)
Requirement already satisfied: certifi>=2017.4.17 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (2022.6.15)
Requirement already satisfied: idna<4,>=2.5 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (3.3)
Requirement already satisfied: charset-normalizer<3,>=2 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (2.1.0)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/evanmuzzall/.local/share/virtualenvs/SSDS-TAML-xaUfvlpM/lib/python3.9/site-packages (from requests->huggingface-hub>=0.4.0->sentence-transformers>=0.4.1->bertopic) (1.26.11)
Building wheels for collected packages: hdbscan, sentence-transformers, umap-learn, pynndescent
  Building wheel for hdbscan (pyproject.toml) ... ?25l-
 \
 |
 /
 -
 \
 |
 /
^C
?25h canceled
ERROR: Operation cancelled by user

zsh:1: no matches found: bertopic[flair]
zsh:1: no matches found: bertopic[visualization]

Data sourcing#

  • For this exercise, we’re going to use a popular data set, ‘20 Newsgroups,’ which contains ~18,000 newsgroups posts on 20 topics. This dataset is readily available to us through Scikit-Learn:

import bertopic
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

documents = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']

print(documents[0]) # Any ice hockey fans? 
I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!

Creating a BERTopic model:#

  • Using the BERTopic module requires you to fetch an instance of the model. When doing so, you can specify multiple different parameters including:

    • language -> the language of your documents

    • min_topic_size -> the minimum size of a topic; increasing this value will lead to a lower number of topics

    • embedding_model -> what model you want to use to conduct your word embeddings; many are supported!

Example instantiation:#

from sklearn.feature_extraction.text import CountVectorizer 

# example parameter: a custom vectorizer model can be used to remove stopwords from the documents: 
stopwords_vectorizer = CountVectorizer(ngram_range=(1, 2), stop_words='english') 

# instantiating the model: 
model = BERTopic(vectorizer_model = stopwords_vectorizer)

Fitting the model:#

  • The first step of topic modeling is to fit the model to the documents:

topics, probs = model.fit_transform(documents)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  • .fit_transform() returns two outputs:

    • topics contains mappings of inputs (documents) to their modeled topic (alternatively, cluster)

    • probs contains a list of probabilities that an input belongs to their assigned topic

  • Note: fit_transform() can be substituted with fit(). fit_transform() allows for the prediction of new documents but demands additional computing power/time.

Viewing topic modeling results:#

  • The BERTopic module has many built-in methods to view and analyze your fitted model topics. Here are some basics:

# view your topics: 
topics_info = model.get_topic_info()

# get detailed information about the top five most common topics: 
print(topics_info.head(5))
   Topic  Count                                       Name
0     -1   6646                     -1_file_use_need_using
1      0   1838                0_team_games_players_season
2      1    616              1_clipper_encryption_chip_nsa
3      2    527  2_cheek ken_ken huh_ignore art_huh ignore
4      3    452          3_israel_israeli_jews_palestinian
  • When examining topic information, you may see a topic with the assigned number ‘-1.’ Topic -1 refers to all input outliers which do not have a topic assigned and should typically be ignored during analysis.

  • Forcing documents into a topic could decrease the quality of the topics generated, so it’s usually a good idea to allow the model to discard inputs into this ‘Topic -1’ bin.

# access a single topic: 
print(model.get_topic(topic=0)) # .get_topics() accesses all topics
[('team', 0.007645058778587724), ('games', 0.006112662299637617), ('players', 0.005412026399964582), ('season', 0.005342811826876292), ('hockey', 0.005239065199444112), ('league', 0.004280045353200042), ('teams', 0.003990602953367509), ('baseball', 0.0037812052034601833), ('nhl', 0.003514144827427642), ('gm', 0.0029900018153221084)]
# get representative documents for a specific topic: 
print(model.get_representative_docs(topic=0)) # omit the 'topic' parameter to get docs for all topics 
["\ni have no idea, nor do i care.  however, i'd like to point out that\nblomberg got the first plate appearance by a designated hitter, and\nthe first walk by a designated hitter.  i am not sure, but i do not\nthink that he also got the first hit by a designated hitter.", ": >\n: >ATLANTIC DIVISION\n: >\t\n: >\tST JOHN'S MAPLE LEAFS VS MONCTON HAWKS\n: >\tMONCTON HAWKS\n: >See CD Islanders. Moncton is a very similar team to CDI. Low scoring,\n: >defensive, good goaltending. John Leblanc and Stu Barnes are the only\n: >noticable guns on the team. But the defense is top notch and \n: >Mike O'Neill is the most underrated goalie in the league.\n: >\n\n: Bri, as I have tried to tell you since 2 February, Michael O'Neill\n: might be the most underrated goalie in the AHL, but he ISN'T in the\n: AHL.  He's on the Winnipeg Jets' injury list, as he has been since\n: his first NHL start against the Ottawa Senators.  He's out until\n: next year after surgery to repair a shoulder separation.\n\n: Stu Barnes might be an AHL gun for the Hawks, but he's now the third\n: line center with the Jets, and has been since mid January or so.\n\nSorry, my memory is gone. I thought that O'Neill got sent back\ndown in February but I must have been given incorrect info. I guess\nthis says it all about Moncton because Barnes is still one of\ntheir top 3 or so scorers even though he's been out since January.", "\n\nI didn't see any smilies in this message so.......\n\n                W     T    L    PTs\n   Team A      50    30    4    104\n   Team B      52    32    0    104\n\n\nThere you go.  Two teams that tie in points without identical records.\n\n"]
# find topics similar to a key term/phrase: 
topics, similarity_scores = model.find_topics("sports", top_n = 5)
print("Most common topics:" + str(topics)) # view the numbers of the top-5 most similar topics

# print the initial contents of the most similar topics
for topic_num in topics: 
    print('\nContents from topic number: '+ str(topic_num) + '\n')
    print(model.get_topic(topic_num))
    
Most common topics:[0, 30, 6, 166, 4]

Contents from topic number: 0

[('team', 0.007645058778587724), ('games', 0.006112662299637617), ('players', 0.005412026399964582), ('season', 0.005342811826876292), ('hockey', 0.005239065199444112), ('league', 0.004280045353200042), ('teams', 0.003990602953367509), ('baseball', 0.0037812052034601833), ('nhl', 0.003514144827427642), ('gm', 0.0029900018153221084)]

Contents from topic number: 30

[('games', 0.03260548961663573), ('sega', 0.02366315012814771), ('arcade', 0.012166539858844822), ('snes', 0.010883627526511617), ('sega genesis', 0.01081910740506706), ('joysticks', 0.010294764495945618), ('games sale', 0.010085068481475858), ('sale', 0.00964091677280479), ('joystick', 0.009006639792149954), ('sega cd', 0.0074012373591723)]

Contents from topic number: 6

[('riding', 0.011792240692170709), ('ride', 0.011256591323418531), ('driving', 0.007418204752466058), ('road', 0.007362304673149508), ('traffic', 0.006971330162717447), ('roads', 0.005093305390738552), ('bikes', 0.0046328368271995445), ('bikers', 0.0041220512073587194), ('riders', 0.0037367046265679754), ('passengers', 0.0035386604055364823)]

Contents from topic number: 166

[('religion', 0.024810151190057972), ('war', 0.01958713595572545), ('wars', 0.0141305144151792), ('crusades', 0.012827683749926261), ('history', 0.01202363443416338), ('religious', 0.009458363539211138), ('unbelievers', 0.008338773663764506), ('yoked unbelievers', 0.007970064155940823), ('statement religion', 0.007495172035922859), ('gods', 0.0071255212864334274)]

Contents from topic number: 4

[('health', 0.0072259305085357), ('cancer', 0.005975505039095839), ('disease', 0.00513078203584376), ('tobacco', 0.005069613472607038), ('medical', 0.00492433353954727), ('hiv', 0.004709304265420622), ('malaria', 0.004112010029452724), ('smokeless tobacco', 0.004033769948845448), ('lyme', 0.003923377448522405), ('medical newsletter', 0.003903230753928965)]

Saving/loading models:#

  • One of the most obvious drawbacks of using the BERTopic technique is the algorithm’s run-time. But, rather than re-running a script every time you want to conduct topic modeling analysis, you can simply save/load models!

# save your model: 
# model.save("TAML_ex_model")
# load it later: 
# loaded_model = BERTopic.load("TAML_ex_model")

Visualizing topics:#

  • Although the prior methods can be used to manually examine the textual contents of topics, visualizations can be an excellent way to succinctly communicate the same information.

  • Depending on the visualization, it can even reveal patterns that would be much harder/impossible to see through textual analysis - like inter-topic distance!

  • Let’s see some examples!

# Create a 2D representation of your modeled topics & their pairwise distances: 
model.visualize_topics()
# Get the words and probabilities of top topics, but in bar chart form! 
model.visualize_barchart()
# Evaluate topic similarity through a heat map: 
model.visualize_heatmap()

Conclusion#